| Less is More

[RVM 2021] Robust High-Resolution Video Matting with Temporal Guidance：字节，temporal（ConvGRU），multi-task

[SparseInst 2022] Sparse Instance Activation for Real-Time Instance Segmentation：自动化所，overview有点像DETR

Robust High-Resolution Video Matting with Temporal Guidance

动机
- human video matting
  - 用于背景替换
  - 现有技术不稳定，会产生artifacts
- performance
  - robust
  - real-time
    - 4K at 76 FPS and HD at 104 FPS on Nvidia GTX 1080Ti
  - high-resolution
- use recurrent structure instead of frame by frame：用时序网络，分割质量更好
- propose a novel training strategy：同时进行matting和segmentation两个任务，模型更鲁邦
论点
- matting formulation recollect
  - $I = \alpha F + (1-\alpha)B$
- matting methods
  - Trimap-based matting：最classical的，需要额外的先验，且通常不分类，只做前景语义分割
  - Background-based matting：不要先验证trimap了，改要先验background map
  - Segmentation：就是binary的语义分割，人像前景效果还好，但是背景容易出现各种artifacts，比较不稳定
  - Auxiliary-free matting：不需要额外输入的架构，MODNet更关注肖像，this paper更关注目标人
  - Video matting：
    - MODNet用相邻帧图像的预测结果来相互压制伪影，本质上仍就是image-independent
    - BGM用多帧图像作为多通道
  - Recurrent architecture：ConvLSTM/ConvGRU
  - High-resolution matting
    - Patch-based refinement：图像尺寸减小，以获取high resolution task的算力，但是
    - Deep guided filter：trainable，模块化，end-to-end将low-reso转换成high-reso
- use temporal structure
  - temporal information boosts both quality and robustness
  - 这种overtime的背景变换使得模型对背景信息的学习更加鲁邦和精确
- introduce a new training strategy
  - 大多数matting数据集都是合成的，包括在数据处理阶段也会做这种前景贴背景的操作，扩充样本量，这种图像太fake了，和实际场景有domain gap，泛化性差
  - 也有方法尝试先在segmentation任务上做预训练、用真实图像做对抗等方式去解决图像假的问题，这样的缺点是multi step
  - 同时训练matting & segmentation任务就一步到位了，没有额外的adaptation steps
方法
- model architecture overview
  - encoder：编码individual frame’s features，mobileNetv3/resnet50
  - recurrent decoder：aggregates temporal information
  - a Deep Guided Filter module：high-resolution upsampling
- Feature-Extraction Encoder
  - MobileNetV3-Large + LR-ASPP module
  - 最后一个block使用了空洞卷积
- Recurrent Decoder
  - ConvGRU at multiple scales
  - bottleneck block：x16 level上
    - 在LR-ASPP之后
    - 后ConvGRU，with id path（split，一半通道用于id，一半通道用于GRU）
    - 然后bilinear 2x
  - Upsampling block：x8/x4/x2 level上
    - 每个resolution stage
    - 先merge（concat）前一个stage的feature
    - 然后avg pooling，conv-bn-relu，transfer the feature
    - 然后ConvGRU，with id path
    - 然后bilinear 2x
  - Output block：x1 level上
    - 去做一个final prediction
    - 先merge
    - 然后【conv3x3-bn-relu】x2
    - 然后conv1x1 head：1-channel alpha/3-channel fg/1-channel segmentation
- Deep Guided Filter Module
  - given high- resolution videos such as 4K and HD
  - 先下采样by a factor s
  - 然后输入网络
  - 最后网络的2个输出（alpha & fg）、网络output block的hidden feature、以及HR的原图这四个信息都给到DGF，to produce high-resolution的alpha和foreground
实验
- training details
  - progressive learning：see longer sequences and higher resolution
  - loss：
    - matting loss（alpha / fg）：L1 & pyramid Laplacian loss + additional temporal coherence loss
    - segmentation loss：BCE

Sparse Instance Activation for Real-Time Instance Segmentation

动机
- fully convolutional real-time instance segmentation
- former work的实例分割通常与目标检测绑定
  - dense anchors
  - fixed reception field by fixed anchors
  - multi-level prediction
  - ROI-Align对移动端/嵌入式设备不友好
  - NMS time-consuming
- this paper
  - a sparse set of activation maps：类似detr的100个proposal
  - 基于attention map得到instance-level的features
  - 匈牙利算法来匹配proposed instance和gt，从而省略NMS，得到稀疏预测
  - 40 FPS and 37.9 AP on the COCO benchmark
  - repo：https://github. com/hustvl/SparseInst
论点
- this paper
  - IAM：instance activation maps，sparse set，motivated by CAM
    - pixel-level：相比较于框里还有bg
    - 全局context & single-level
    - simple op：avoid ROI-Align/NMS这些不可避免的循环操作
    - bipartite的稀疏监督：inhibit the redundant predictions, thus avoiding NMS
  - recognition and segmentation：在IAM的instance feature基础上执行下游任务
- overall structure
  - encoder：backbone + PPM，giving x8 fused features
  - decoder：multi-branch
    - instance branch：IAM，
    - mask branch：语义分割，
方法
- IAM：Instance Activation Maps
  - 首先一个基本假设：encoder得到的feature是redundant
  - IAM的op
    - 一个id分支，传入原始feature，[b,h,w,d]
    - 一个feature selection分支（conv+sigmoid+norm），[b,h,w,N]
    - 两个分支做矩阵乘法，[b,N,d]：feature selection分支，给出了基于原始feature的N forms of spatial reweighting方案，作为最终的attention proposals
  - downstream task：recognition and segmentation
    - kernel
    - class
    - score